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Abstract 

Inference and learning of graphical models 
are both well-studied problems in statistics 
and machine learning that have found many 
applications in science and engineering. How¬ 
ever, exact inference is intractable in general 
graphical models, which suggests the prob¬ 
lem of seeking the best approximation to a 
collection of random variables within some 
tractable family of graphical models. In this 
paper, we focus on the class of planar Ising 
models, for which exact inference is tractable 
using techniques of statistical physics. Based 
on these techniques and recent methods for 
planarity testing and planar embedding, we 
propose a simple greedy algorithm for learn¬ 
ing the best planar Ising model to approx¬ 
imate an arbitrary collection of binary ran¬ 
dom variables (possibly from sample data). 
Given the set of all pairwise correlations 
among variables, we select a planar graph 
and optimal planar Ising model defined on 
this graph to best approximate that set of 
correlations. We demonstrate our method in 
simulations and for the application of model¬ 
ing senate voting records. 


1 Introduction 


Graphical models are widely used to represent the sta¬ 


tistical relations among a set of random variables (Lau- 


ritzen 


1996 MacKay, 2003). Nodes of the graph cor¬ 


respond to random variables and edges of the graph 
represent statistical interactions among the variables. 
The problems of inference and learning on graphical 
models arise in many practical applications. The prob¬ 
lem of inference is to deduce certain statistical proper¬ 
ties (such as marginal probabilities, modes etc.) of a 
given set of random variables whose graphical model is 
known. It has wide applications in areas such as error 
correcting codes, statistical physics and so on. The 


problem of learning on the other hand is to deduce 
the graphical model of a set of random variables given 
statistics (possibly from samples) of the random vari¬ 
ables. Learning is also a widely encountered problem 
in areas such as biology, anthropology and so on. 


The Ising model , a class of binary-variable graphical 
models with pairwise interactions, has been studied 
by physicists as a simple model of order-disorder tran¬ 
sitions in magnetic materials ( Onsager[ 1944). Re¬ 
markably, it was found that in the special case of an 
Ising model with zero-mean { —1,+1} binary random 
variables and pairwise interactions defined on a planar 
graph, calculation of the partition function (which is 
closely tied to inference) is tractable, essentially reduc¬ 
ing to calculation of a matrix determinant (|Kac and 
Ward) 1952| Sherman] 1960| |Kasteleyn| |1963| |Fisher[ 
1966). These methods have been used in machine 


learning (Schraudolph and Kamenetsky, 

2008 Glober- 

son and Jaakkola, 2007 

)• 


We address the problem of approximating a collec¬ 
tion of binary random variables (given their pairwise 
marginal distributions) by a zero-mean planar Ising 
model. We also consider the related problem of se¬ 
lecting a non-zero mean Ising model defined on an 
outer-planar graph (these models are also tractable, 
being essentially equivalent to a zero-field model on a 
related planar graph). 


There has been a great deal of work on learning graph¬ 
ical models. Much of these have focused on learn¬ 


ing over the class of thin graphical models (Desh- 


pande et al. [ 2001 ] |Bach and Jordan) |2001| |Karger| 
and Srebro| 2001 j " Shahaf et ah, 2009) for which infer¬ 


ence is tractable by converting the model to a junction 
tree. The simplest case of this is learning tree models 
(treewidth one graphs) for which it is tractable to find 
the best tree model by reduction to a max-weight span¬ 
ning tree problem ( [Chow and Lin , 1968). However, the 
problem of finding the best bounded-treewidth model 


is NP-hard for treewidths greater than two (Karger 


and Srebro, 2001), and so heuristic methods are used 


to select the graph structure (Deshpande et ah, 2001 


Karger and Srebro, 2001). Another popular method is 
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to use convex optimization of the log-likelihood penal¬ 
ized by t\ norm of parameters of the graphical model 


so as to promote sparsity (Banerjee et al. 2008 Lee 


et ah, 2006). To go beyond low-treewidth graphs, such 


methods either focus on Gaussian graphical models 
or adopt a tractable approximation of the likelihood. 
Other methods learn only the graph structure itself 
(Ravikumar et ah, 2010 Abbeel et ah, 2006) and are 
often able to demonstrate asymptotic correctness of 
this estimate under appropriate conditions. 


In contrast to existing approaches, this paper explores 
planarity as an alternative restriction on the model 
class to both make learning tractable and to offer 
a qualitatively different graph topology in which the 
number of edges learned is linear in the number of 
variables. 


2 Preliminaries 

In this section, we develop our notation and briefly 
review the necessary background theory. 


2.1 Divergence and Likelihood 

Suppose we want to calculate how well a proba¬ 
bility distribution Q approximates another probabil¬ 
ity distribution P (on the same sample space y). 
For any two probability distributions P and Q on 
some sample space %, we denote by D(P, Q ) the 
Kullback-Leibler divergence (or relative entropy) be- 

tween P and Q as D(P,Q) = Y^xex log CM' 
The log-likelihood function is defined as LL(P 1 Q ) = 
P( x ) l°g Q(x). The probability distribution in a 
family T that maximizes the log-likelihood of a proba¬ 
bility distribution P is called the maximum-likelihood 
estimate of P in P, and this is equivalent to the 
minimum-divergence projection of P to P, so that 
P T = arg maxg e jT LL(P, Q) = arg min Q6J - D(P, Q). 


2.2 Graphical Models and The Ising Model 

We will be dealing with binary random variables 
throughout the paper. We write P(x) to denote the 
probability distribution of a collection of random vari¬ 
ables x = (xi,...,x n ). Unless otherwise stated, we 
work with undirected graphs G = (V,E) with vertex 
(or node) set V and edges {i,j} G E C (^). For ver¬ 
tices z, j G V we write G + ij to denote the graph 
(V,E U {z, j}). A pairwise graphical model is a prob¬ 
ability distribution P(x) = P(x i,...,x n ) that is de¬ 
fined on a graph G = (U, E) with vertices V = {1,.., n} 


as 

P(x) OC II ^iA X ^ X o) 

iev { i,j}€E 

oc exp £ fi(Xi)+ ^2 fij( x i,Xj) 

where ipij > 0 are non-negative node and edge com¬ 
patibility functions. For positive z/>’s, we may also rep¬ 
resent P(x) as a Gibbs distribution with potentials 
fi = logV’i and fij = log ipij. 

Definition 1. An Ising model on binary random vari¬ 
ables x = (xi,...,x n ) and graph G = (V,E) is the 
probability distribution defined by 



where Xi G { — 1,1}. The partition function Z(6) 
serves to normalize the probability distribution. 


Formally, this defines an exponential family Pq(x) = 


exp {0 T (f)(x) — <£>(#)} (Barndorff-Nielsen 


1979 


Wain- 


wrigh tlmd Jordan] 2008) based on sufficient statistics 


(f>i(x) = Xi,i G V) and = XiXj,{i,j} G E), 

parameters (0^,z G V) and G E) and mo¬ 

ment parameters (fii = E [xf\,i G V) and ( fiij = 
E [xiXj\, {i,j} G E). The function &(0) = log Z(0) 
is a convex function of 0 and has the moment generat¬ 
ing properties: V < F(6 > ) = E o[(j){x)] = fi and V * 2 $(6 > ) = 
Eg[((f>(x) - H){(j>{x) ~ 


In fact, any pairwise graphical model among binary 
variables can be represented as an Ising model: 


2 ^ v Xjfi (xf) T ^ ^ ^ ^ ^ %ifij(%iiXj ), 

Xi {i,j}eE Xi,Xj 

Oij = 4 ^ ^ XiXjfij(Xi,Xj). 

Xi,Xj 


The moments can be computed as: yiq = XiP(xf) 
and piij = Yh Xi , Xj x iXjP(xi,Xj). Inversely, the 
marginals are computed by: 


E(xf) 2 (1 T 

P(Xi,Xj) = \(l + pbiXi + IXjXj + flijXiXj). 

An Ising model is said to be zero-field if 6i = 0 for all 
i G V. It is zero-mean if fii = 0 (P(xi = ±1) = |) 
for all i G F. The Ising model is zero-field if and only 
if it is zero-mean. Although the zero-field assumption 
appears very restrictive, a general Ising model can be 
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represented as a zero-field model by adding one auxil¬ 
iary variable node connected to every other node of the 
graph (Globerson and Jaakkola, 2007). The parame¬ 
ters and moments of the two models are then related 
as follows: 


Proposition 1. Consider the Ising model on G = 
(V, E) with V = {1,..., n}, parameters {0i} and {Oij}, 
moments {fii} and {Lij} and partition function Z. Let 
G = (U, E) denote the extended graph based on nodes 
V = VU{n+l} with edges E = EU{{i, n+ 1}, i eV}). 
We define a zero-field Ising model on G with param¬ 
eters {Oij}, moments {fiij} and partition function Z. 
If we set the parameters according to 


Oij = 

then Z = 2 Z and 

Lij = 


f 0i if j =n- hi 
| 0^ otherwise 


f Li if j = n+ 1 
| fiij otherwise 


Thus, inference on the corresponding zero-field Ising 
model on the extended graph G is equivalent to infer¬ 
ence on the (non-zero-field) Ising model defined on G. 
Proof given in the Supplement. 


2.3 Inference for Planar Ising Models 


A graph is planar if it may be embedded in the plane 
without any edge crossings. It is known that any pla¬ 
nar graph can be embedded such that all edges are 
drawn as straight lines. The motivation for our paper 
is the following result on tractability of inference for 
the planar zero-field Ising model. 


Theorem 1 . (Kac and Ward , 1952; Sherman\ \ 1960\ 
Loebl\ 2010]) Let G be a planar graph with specified 
straight-line embedding in the plane and let <f>ijk £ 
[—7r, Ttt] denote the clockwise rotation between the di¬ 
rected edges ( i,j ) and (j,k). We define the matrix 
W £ £^\e\x 2 \e\ i n g exe g fry directed edges of the graph 
as follows: W = AD where D is the diagonal matrix 
with Dij^j = tanh Oij = Wij and 


( exp(iv / -T j = k and i ± l 
\ 0, otherwise 


Then, the partition function of the zero-field planar 
Ising model is given by the Kac- Ward determinant for¬ 
mula: 

Z = 2 n ( cosh% j det (I-W)* 

M iJ}eE ' 


Another related method for computing the Ising model 
partition function is based on counting perfect match¬ 
ings of planar graphs (Kasteleyn, 1963 Fisher, 1966). 


Thus, calculating the partition function reduces to cal¬ 
culating the determinant of a matrix; therefore, using 
the generalized nested dissection algorithm to exploit 


culations is 0(n 3 / 2 ) <|Lipton et al. 

1979 

Lipton and 

Tar j an 

1979; Galluccio et al. 2000 

). Thus, inference 


of the zero-field planar Ising model is tractable and 
scales well with problem size. 


The gradient and Hessian of the log-partition function 
<1 ?(0) = log Z(0) can also be calculated efficiently from 
the Kac-Ward determinant formula. Derivatives of 
<1 >(0) recover the moment parameters of the exponen¬ 
tial family model as V&(0) = E e[4>\ = L flBarndorff- 


Nielsen[ 1979 Wainwright and Jordan, 20Q8| ). Thus, 


inference of moments (and node and edge marginals) 
are tractable for the zero-field planar Ising model. 


Proposition 2. Let fi = V$(<9) ; H = V 2 <f>(<9). Let 
S = {I- W)~ X A and T = (/ + P)(S o S T ){I + P T ) 
where A and W are defined as in Theorem 1, o denotes 
the element-wise product and P is the permutation ma¬ 
trix swapping indices of directed edges (i,j) and (j,i). 
Then, 


Lij — w ij 2(1 W ij)(^ij,ij + Sjiji) 

_ f 1 — Liji ij = kl 

,J,kl ~ \ -^(1 - Wij)Tij,ki( 1 - Wfcz), otherwise 


Calculating the full matrix S requires 0(n 3 ) calcula¬ 
tions. However, to compute just the moments fi only 
the diagonal elements of S are needed. Then, using 
the generalized nested dissection method, inference of 
moments (edge-wise marginals) of the zero-field Ising 
model can be achieved with complexity 0(n 3 / 2 ). Com¬ 
puting the full Hessian is more expensive, requiring 
0(n 3 ) calculations. 


Inference for Outer-Planar Graphical Models 

We emphasize that the above calculations require both 
a planar graph G and a zero-field Ising model. Us¬ 
ing the graphical transformation of Proposition 1, the 
latter zero-field condition may be relaxed but at the 
expense of adding an auxiliary node connected to all 
the other nodes. In general planar graphs G, the new 
graph G may not be planar and hence may not admit 
tractable inference calculations. However, for the sub¬ 
set of planar graphs where this transformation does 
preserve planarity inference is still tractable. 

Definition 2. A graph G is said to be outer-planar if 
there exists an embedding of G in the plane where all 
the nodes are on the outer face. 

In other words, the graph G is outer-planar if the ex¬ 
tended graph G (defined by Proposition 1) is planar. 
Then, from Proposition 1 and Theorem 1 it follows 
that: 






































































Learning Planar Ising Models 


Proposition 3. (Globerson and Jaakkola\ ^2007) The 
partition function and moments of any outer-planar 
Ising graphical model (not necessarily zero-field) can 
be calculated efficiently. Hence, inference is tractable 
for any binary-variable graphical model with pairwise 
interactions defined on an outer-planar graph. 


This motivates the problem of learning outer-planar 
graphical models for a collection of (possibly non-zero 
mean) binary random variables. 


3 Learning Planar Ising Models 

This section addresses the main goals of the paper, 
which are two-fold: 

1. Solving for the maximum-likelihood Ising model 
on a given planar graph to best approximate a 
collection of zero-mean random variables. 

2. How to select (heuristically) the planar graph to 
obtain the best approximation. 

We address these problems in the following two subsec¬ 
tions. The solution of the first problem is an integral 
part of our approach to the second. Both solutions are 
easily adapted to the context of learning outer-planar 
graphical models of (possibly non-zero mean) binary 
random variables. 


3.1 Maximum-Likelihood Parameter 
Estimation 


Newton’s Method We solve this unconstrained 
convex optimization problem using Newton’s method 
with step-size chosen by back-tracking line search 


(Boyd and Vandenberghe 2004). This produces a se¬ 
quence of estimates #dTcai cu i a ted as follows: 

Q(t+1) = 0(t) + A(0W)-I(p(0(t)) - n) 

where /i(flW) and H(0 <L> ) are calculated using Propo- 
sition 2 and X t G (0,1] is a step-size parameter cho¬ 
sen by backtracking line search (see Boyd and Van¬ 
denberghe (2004): Chapter 9, Section 2 for details). 


The per iteration complexity of this optimization is 
0(n 3 ) using explicit computation of the Hessian at 
each iteration. This complexity can be offset some¬ 
what by only re-computing the Hessian a few times 
(reusing the same Hessian for a number of iterations), 
to take advantage of the fact that the gradient compu¬ 
tation only requires 0(n 2 ) calculations. As Newton’s 
method has quadratic convergence, the number of it¬ 
erations required to achieve a high-accuracy solution 
is typically 8-16 iterations (essentially independent of 
problem size). We estimate the computational com¬ 
plexity of solving this convex optimization problem as 
roughly 0(n 3 ). 


3.2 Greedy Planar Graph Selection 

We now consider the problem of selection of the planar 
graph G to best approximate a probability distribu¬ 
tion P{x) with pairwise moments j±ij = E p[xiXj\ given 
for all i, j G V. Formally, we seek the planar graph 
that maximizes the log-likelihood (minimizes the di¬ 
vergence) relative to P: 


Maximum-likelihood estimation over an exponential 
family is a convex optimization problem based on the 
log-partition function 4>($). In the case of the zero- 
field Ising model defined on a given planar graph it is 
tractable to compute $(0) via a matrix determinant 
described in Theorem 1. Thus, we obtain an uncon¬ 
strained, tractable, convex optimization problem for 
the maximum-likelihood zero-field Ising model on the 
planar graph G to best approximate a probability dis¬ 
tribution P{x): 

ma x{/i t 6 — <f>(0)} 

max — log cosh Oij) — \ log det(/ — W(9)) 

1 ^ ij J 

Here, f±ij = E p[x{Xj\ for all edges {i,j} G G and 
the matrix W{6) is as defined in Theorem 1. If P 
represents the empirical distribution of a set of inde¬ 
pendent identically-distributed (iid) samples {#W, s = 
1,... , S'} then { Hij } are the corresponding empirical 
moments \ x^\ 


G — arg maxLL(P, Pq) = arg max max LL[P , Q ) 
Gev v Gev v Q^o 

where Vy is the set of planar graphs on the vertex set 
V, Tq denotes the family of zero-field Ising models 
defined on graph G and Pq = argmaxg e:FG LL[P , Q ) 
is the maximum-likelihood (minimum-divergence) ap¬ 
proximation to P over this family. 

We obtain a heuristic solution to this graph selection 
problem using the following greedy edge-selection pro¬ 
cedure. The input to the algorithm is a probabil¬ 
ity distribution P (which could be empirical) on n 
binary { — 1,1} random variables. In fact, it is suf¬ 
ficient to summarize P by its pairwise correlations 
f±ij — E p[xiXj\ on all pairs i,j G V. The output is a 
maximal planar graph G and the maximum-likelihood 
approximation 6q to P in the family of zero-field Ising 
models defined on this graph. A maximal planar graph 
is a planar graph for which no new edge can be added 
that would maintain planarity. 

The algorithm starts with an empty graph and then 
sequentially adds edges to the graph one at a time so as 
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Algorithm 1 GreedyPlanarGraphSelect(P) 

1 : G = 0, Oq = 0 

2: for k = 1 : 3n — 6 do 

3 : A = {{i,j} c £G,G + ij £ V v } 

4: jl A = {flij = E e G [xiXj\, {ijj} e A} 

5: G^-GU argmaxD(P e , P e ) 

eG A 

6 : Og = PlanarIsing(G, P) 

7: end for 


> Add edges until maximal planar graph reached 
> Set of candidate edges that preserve planarity 
t> Compute pairwise correlations 
> Select edge that maximizes improvement in log-likelihood 

> Compute maximum-likelihood parameters for G 


to (heuristically) increase the log-likelihood (decrease 
the divergence) relative to P as much as possible at 
each step. Here is a more detailed description of the 
algorithm along with estimates of the computational 
complexity of each step: 


Thus, we greedily select the next edge {i, j} to 
add so as to maximize this lower-bound on the 
improvement measured by the increase on log- 
likelihood (this being equal to the decrease in KL- 
divergence). 


• Line 3. First, we enumerate the set A of ah edges 
one might add (individually) to the graph while 
preserving planarity. This is accomplished by an 
0(n 3 ) algorithm in which we iterate over ah pairs 
{i, j} 0 G and for each such pair we form the 
graph G + ij and test planarity of this graph us¬ 


ing known 0(n) algorithms (Chrobak and Payne 


1995). 


• Line 4- Next, we perform tractable inference cal¬ 
culations with respect to the Ising model on G to 
calculate the pairwise correlations \±ij for ah pairs 
{i, j} G A. This is accomplished using 0(n 3 / 2 ) in¬ 
ference calculations on augmented versions of the 
graph G. For each inference calculation we add 
as many edges to G from A as possible (setting 
0 = 0 on these edges) while preserving planarity 
and then calculate all the edge-wise moments of 
this graph using Proposition 2 (including the zero- 
edges). This requires at most 0(n) iterations to 
cover ah pairs of A, so the worst-case complex¬ 
ity to compute ah required pairwise moments is 
0(n 5 / 2 ). 


• Line 5. Once we have these moments, which 
specify the corresponding pairwise marginals of 
the current Ising model, we compare these mo¬ 
ments (pairwise marginals) to those of the input 
distribution P by evaluating the pairwise KL- 
divergence between the Ising model and P. As 
seen by the following proposition, this gives us 
a lower-bound on the improvement obtained by 
adding edge {i,j} (see Supplement for proof): 

Proposition 4. Let Pq and PG+ij be projections 
of P on G and G + ij respectively. Then, 


D(P, P g )-D(P, P G +ij) > D(P(xi,Xj),P G (xi,Xj)) 


where P(xi,Xj) and P G (xi,xj) represent the 
marginal distributions on x^xj of probabilities P 
and Pg respectively. 


• Line 6. Finally, we calculate the new maximum- 
likelihood parameters Og on the new graph G <— 
G + ij. This involves solving the convex opti¬ 
mization problem discussed in the preceding sub¬ 
section, which requires 0(n 3 ) complexity. This 
step is necessary in order to subsequently calcu¬ 
late the pairwise moments jl which guide further 
edge-selection steps, and also to provide the final 
estimate. 

We continue adding one edge at a time until a maximal 
planar graph (with 3n — 6 edges) is obtained. Thus, 
the total complexity of our greedy algorithm for planar 
graph selection is 0 (n 4 ). 


Non-Maximal Planar Graphs Since adding an 
edge always improves the log-likelihood, the greedy 
algorithm always outputs a maximal planar graph. 
However, this might lead to over-fitting of the data 
especially when the input probability distribution is 
an empirical distribution. Note that at 3n — 6 edges, 
the maximal planar graph is sparse and our empirical 
work indicates that over-fitting is often not an issue. 
In the case that over-fitting is a concern, we could 
terminate the algorithm when adding an edge to the 
graph would only improve the log-likelihood by less 
than some threshold 7 . An experimental search can be 
performed for a suitable value of this threshold (e.g. 
so as to minimize some estimate of the generalization, 
such as in cross validation methods (Zhang, 1993])). 
Or, one could use some heuristic value for 7 based on 
the number of samples such as Akaike’s information 
criterion (AIC) or Shwarz’s Bayesian information cri¬ 
terion (BIC) ( | Akaike , 1974 Schwarz 1978). 


Outer-Planar Graphs and Non-Zero Means 

The greedy algorithm returns a zero-held Ising model 
(which has zero mean for all the random variables) de¬ 
fined on a planar graph. If the actual random variables 
are non-zero mean, this may not be desirable. For 
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this case we may prefer to exactly model the means of 
each random variable but still retain tractability by re¬ 
stricting the greedy learning algorithm to select outer- 
planar graphs. This model faithfully represents the 
marginals of each random variable but at the cost of 
modeling fewer pairwise interactions among the vari¬ 
ables. 

This is equivalent to the following procedure. First, 
given the sample moments {/q} and {/i^} we convert 
these to an equivalent set of zero-mean moments jil on 
the extended vertex set V = V U {n + 1} according 
to Proposition 1. Then, we select a zero-mean planar 
Ising model for these moments using our greedy algo¬ 
rithm. However, to fit the means of each of the original 
n variables, we initialize this graph to include all the 
edges {z, n + 1} for all z E V (requiring that these are 
present in our final estimate of the graph G ). After 
this initialization step, we use the same greedy edge- 
selection procedure as before. This yields the graph G 
and parameters 6q. Lastly, we convert back to a (non¬ 
zero field) Ising model on the subgraph of G defined on 
nodes V, as prescribed by Proposition 1. The resulting 
graph G and parameters Oq is our heuristic solution 
for the maximum-likelihood outer-planar Ising model. 

We remark that it is not essential to choose between 
the zero-field planar Ising model and the outer-planar 
Ising model. The greedy algorithm may instead select 
something in between—a partial outer-planar Ising 
model where only nodes of the outer-face are allowed 
to have non-zero means. This is accomplished sim¬ 
ply by omitting the initialization step of adding edges 
{z, n + 1} for all iGk. 

4 Experiments 

We present the results of experiments evaluating our 
algorithm on known models with simulated data to 
evaluate the correctness of the learned models. We 
generate two styles of known Ising models: a 7 x 7 
grid (n = 49) with zero-field; and a 12-node outer pla¬ 
nar model where nodes have non-zero mean; shown in 
Fignres [l (a)| and [l(d)| The edge parameters are cho¬ 
sen uniformly randomly between —1 and 1 with the 
condition that the absolute value be greater than a 
threshold (chosen to be 0.05) so as to avoid edges with 
negligible interactions. We use Gibbs sampling to ob¬ 
tain samples from this model and calculate empirical 
moments from these samples which are then passed 
as input to our algorithm. We run 10 trials of ran¬ 
domly generated edge parameters and data samples. 
Though our algorithm can run on graphs with many 
more nodes, we choose small examples here to illus¬ 
trate the result effectively. On the outer planar model, 
we ensure that the first moments of all the nodes are 


satisfied by starting our algorithm with the auxiliary 
node connected to all other nodes. 


As the planar learning algorithm adds edges to the 
model, the likelihood of the training data is guaran¬ 
teed to increase. We assess how adding edges affects 
the likelihood of out-of-sample test data. Figures [l(b)| 
and |l(e)| demonstrate that likelihood on test sets gen¬ 
erally increases as edges are added up to the maximal 
planar graph. The true number of edges in each syn¬ 
thetic graph is marked with a vertical dotted line. On 
the smallest datasets (100 samples) the out-of-sample 
performance begins to degrade, a sign of over-fitting 
the training data; yet the likelihood of the maximal 
graph is not significantly worse than the best likeli¬ 
hood obtained (with fewer edges). 


We also compare against a Markov random field 
(MRF) learning algorithm for binary data (Schmidt 
et~ak| [2008]), as implemented in the undirected graph¬ 


ical model learning Matlab package, UGMLearrj^] 
UGM is not restricted to learning planar graphs. The 
objective is optimized via projected gradient descent. 
We try two versions of the objective function, one us¬ 
ing pseudo-likelihood and the other using loopy belief 
propagation for inference. UGM employs a regulariza¬ 
tion parameter which we set using two different meth¬ 
ods. First, we used the tuning method on validation 
data as detailed in Schmidt et al. (2008). That is, we 
split the data into two parts, train on half the data us¬ 
ing 7 different values for the parameter, measure the 
data likelihood of the other half of the data and vice- 
versa, then select the parameter value that maximizes 
the validation data likelihood across both folds. The 
learned model is trained on the full training data with 
the tuned regularization parameter value. The sec¬ 
ond method for setting the regularization parameter 
we call the oracle method, where we select the learned 
model at the true number of edges, &, in our known 
models. For UGM, we set the regularization parame¬ 
ter via linear search until k edges are learned. 


We compare the likelihood of test data from the vari¬ 
ous learned models in Figures [l(c)| and |l(f)| For com¬ 
parison, we selected the maximal planar graph that 
our algorithm learns, Planar maximal; as well as the 
planar graph learned if the algorithm were stopped 
when the true number of edges are learned, Planar 
oracle. We compare against UGM pseudo tuned and 
UGM loopy tuned, both of which tune the regulariza¬ 
tion parameter on validation data; but the former uses 
pseudo-likelihood in learning and the latter uses loopy 
belief propagation. The tuning method is the most 
common way of selecting the regularization parame¬ 
ter, but tends to produce relatively dense graphs. For 


1 http:/ / www.cs.ubc.ca/~murphyk/Software/LlCRF 
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(a) 7 x 7 grid 




(b) Loglikelihood versus number edges (c) Comparison of algorithms 





(d) Outer planar (e) Loglikelihood versus number edges (f) Comparison of algorithms 

Figure 1: Results on known models, (top row): 7x7 grid; and (bottom row): outer planar. Left column (a,d): true 
graph. Middle column (b,e): likelihood of learned planar graphs as edges are added; and the true number of edges is 
marked with a vertical dashed line. Right column (c,f): likelihood of test data for various algorithms; x-axis values are 
perturbed horizontally so that overlapping errorbars are visible. 


fair comparison, we also show the likelihood of UGM 
pseudo oracle and UGM loopy oracle; that is, the 
model with the known true number of edges. 

Figures |l(c)| and |l(f)| show that our greedy planar 
Ising model learning algorithm is at least as accu¬ 
rate and often better than the UGM learning algo¬ 
rithms on these inputs. As mentioned earlier, we see 
that Planar maximal and Planar oracle fit test data 
nearly equally well. On the outer planar model, UGM 
pseudo tuned performs nearly as well as our planar 
algorithm, yet on the larger grid model it performs 
quite poorly at the smaller sample sizes. UGM loopy 
tuned performs more consistently close to our planar 
algorithm, but it seems that loopy belief propagation 
performs worse at large sample sizes. 

On the largest dataset (10 5 samples) of the 7x7 grid 
model, UGM was aborted after running for 40 hours 
without reaching convergence on a single run, and so 
results are not available. 


5 Application: Modeling Correlations 
of Senator Voting 


We consider an interesting application of our algo¬ 
rithm to model correlations of senator voting follow¬ 
ing Banerjee et al. (2008). We use senator voting data 
from the years 2009 and 2010 to calculate correlations 


in the voting patterns among senators. A Yea vote is 
treated as +1 and a Nay vote is treated as —1. We 
also treat non-votes as —1, but only consider sena¬ 
tors who voted in at least | of the votes to limit bias. 
The data includes n = 108 variables and 645 samples. 
To accommodate the non-zero mean data we add an 
auxiliary node and allow the algorithm to select the 
connections between it and other nodes. We run a 
10-fold cross-validation, training on 90% of the data 
and measuring likelihood on the held-out 10% of data. 
Figure [2(b)] shows that the likelihood of test data in¬ 
creases as edges are added. We also show the likeli¬ 
hood of cross-validation test data for the UGM pseudo 
and UGM loopy algorithms for two different methods of 
choosing the value of the regularization parameter: (1) 
the value that produces the same number of edges as 
the maximal planar graph (at 318 edges); and (2) the 
value selected by tuning with validation data (at a vari¬ 
able number of edges, typically a dense graph). The 
likelihood of the sparse UGM models are significantly 
worse than the planar model. Only the UGM loopy al¬ 
gorithm at a very dense (nearly fully connected) graph 
has better fit to test data. 

The maximal planar graph learned from the full 
dataset, shown in Figure [2j conveys many facts that 
are already known to us. For instance, the graph shows 
Sanders with edges only to Democrats which makes 
sense because he caucuses with Democrats. Same is 
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Sanders (l-VT^ 
Harkin (D-IAL. 


Roberts (R-KS^ 
Hutchison (R-TXL, 


Isakson < 
Chambliss (R-GAL, 


Ensigrfi(iRsp|W|t 
Inhofe (R-OK , 



Number of edges learned 


(a) Learned planar graphical model representing the senator voting pattern (b) Comparison of algorithms 

Figure 2: Senator voting results, (a) Blue nodes represent Democrats, red no des represent Repu b licans and black 


nodes represents Independents. We use a force-directed graph drawing algorithm (Fruchterman and Reingold 1991). (b) 
Likelihood of holdout data versus the number of edges in the learned graph. Note tKe break irTthe x-axis, due to tuned 
UGM learning dense graphs. On the tuned UGM models, we indicate standard error on number of edges learned. 


the case with Lieberman. The graph also shows the 
senate minority leader McConnell well connected to 
other Republicans though the same is not true of the 
senate majority leader Reid. The learned UGM mod¬ 
els can be seen in the Supplement, and they show 
that the non-planar models are qualitatively different, 
learning one or two densely connected components. 

6 Conclusion and Future Work 

We provide a greedy heuristic to obtain the maximum- 
likelihood planar Ising model approximation to a col¬ 
lection of binary random variables with known pair¬ 
wise marginals. The algorithm is simple to imple¬ 
ment with the help of known methods for tractable ex¬ 
act inference in planar Ising models, efficient methods 
for planarity testing and embedding of planar graphs. 
Empirical results of our algorithm on sample data and 
on the senate voting record show that it is competitive 
with arbitrary (non-planar) graph learning. 

Many directions for further work are suggested by the 
methods and results of this paper. Firstly, we know 
that the greedy algorithm is not guaranteed to find the 
best planar graph. In the Supplement, we provide an 
enlightening counterexample in which the combination 
of the planarity restriction and greedy method prevent 
the correct model from being learned. That counterex¬ 


ample suggests strategies one might consider to further 
refine the estimate. One strategy would be to allow 
the greedy algorithm to prune edges which turn out to 
be less important once later edges are added. It would 
also be feasible to implement a multi-step greedy look¬ 
ahead search technique for selection of which edge to 
add (or prune) next. 

Another limitation is that our current framework only 
allows learning planar graphical models on the set of 
observed random variables and requires that all vari¬ 
ables are observed in each sample. One could imag¬ 
ine extensions of our approach to handle missing sam¬ 
ples or to try to identify hidden variables that were 
not seen in the data. This concept offers another av¬ 
enue to achieve a better fit to data that is not well- 
approximated by a planar graph among just the set 
of observed nodes, but might be well-approximated as 
the marginal distribution of a planar model with more 
nodes. 
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Supplementary Appendix 
A Proofs 


Proposition 1. Let the probability distributions cor¬ 
responding to G and G be P and P respectively and 
the corresponding expectations be E and E respec¬ 
tively. For the partition function, we have that 


= E exp ( E Qij x i* 

x v \{i,j}eE 

= ^]exp I x n+1 Y djXj + ^2 QijXiXj 

x v V {i,j}EE 

= ^2 exp I ^2 @i x i + ^2 

XV \iev {i,j}eE 

+ E exp -^2 0i Xi + E ^ 

x v \ iev {i,j}eE 

= 2 ^exp OiXi + E ^ XJ J 

XV \i£V 

= 2 Z 


where the fourth equality follows from the symmetry 
between —1 and 1 in an Ising model. 

For the second part, since P is zero-field, we have that 

E[xi] = 0 \/ieV 

Now consider any {i, j} G E. If x n +i is fixed to a value 
of 1, then the model is the same as original on V and 
we have 

E [xiXj | x n+1 = 1] = E [xiXj\ V {i,j} e E 

By symmetry (between —1 and 1) in the model, the 
same is true for x n +i = — 1 and so we have 

E [x{Xj\ 

= E [x^j | x n+1 = l]P(x n+1 = 1) 

+E[xiXj | x n+1 = -l]P(x n+1 = -1) 

= E [x{Xj\ 

Fixing x n+ i to a value of 1, we have 

E [xi | x n +i = 1] = E [xi\ V i e V 
and by symmetry 

E [xi | x n+1 = -1] = -E [xi\ V i G V 
Combining the two equations above, we have 
E[x 2 X n _|_i] 

= E [Xi | X n+1 = l\P(Xn+X = 1) 

+E[—Xi | x n+1 = -l\P(x n+1 = -1) 

= E[xi\ 

□ 


Proposition 2. From Theorem 1, we see that the log 
partition function can be written as 


<£>(#) = nlog 2 + log cosh 6i 

{iJ}eE 


- log det(/ — AD) 


where A and D are as given in Theorem 1. For the 
derivatives, we have 


=tanh% + il 

= tanh % - ±Tr ((/ - AD ) -1 AD',,) 

= Wij 2 ^ W'ij ) "1“ Sj j j { ) 


where D'-- is the derivative of the matrix D with re- 

l j 

spect to Oij. The first equality follows from chain rule 
and the fact that ViF = K~ l for any matrix K. Please 
refer to Boyd and Vandenberghe (2004) for details. 

For the Hessian, we have 


d 2 <f>( 0 ) _ i d 2 z(o) i fdz( 0 )\ 2 

do^ ~ z{ 0 ) de\. z(oy V 80^ J 

— 1 — 


For {i, j} 7 ^ {A:,/}, following Boyd and Vandenberghe 
( 2004| ), we have 


a 2 $( 6 >) 

dOijdOki 


= -|Tr (SD[jSD’ kl ) 

= 2 ^ ^V/) T $ji,klEkl,ji 

T Eij,lkElk,ij + Sji,i k Si k ji ) (1 - wh ) 


On the other hand, we also have 

T ijM = efjil + P)(S o S T )(I + P)e kl 
= (ejj + eji) T (S o S T )(e k i + eik) 
= (S o S T ) i j j ki + (S o S T )ijjk 
+(SoS T ) jiM + (SoS T ) ji , lk 

+ Sij ,lk$lk,ij T SjijkSlk : ji 


where e^- is the unit vector with 1 in the ij ^ position 
and 0 everywhere else. Using the above two equations, 
we obtain 

= -1(1 - w^)T ijM { 1 - wh) 

□ 


Proposition 4. The proof follows from the following 
steps of inequalities. 

D(P, P G ) = D(P, P G+ij ) + D(P G+ij , P G ) 

— D(P, P G+ ij)P 
D(Pg+ij (xi, Xj), P G (xi, xj))+ 
D(P G +ij(xv-ij), P G (xv-ij)) 

> D(P , P G +ij)-\- 
D(P G +ij (xi , Xj ), P G (xi , xj ))■+ 

E D(P, P G +ij)+ 
D(P(x i ,x j ),P G (x i ,x j )) 
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(a) Counter example original 



Figure 3: Example graphical models, (a) Counter ex¬ 
ample. (b) The recovered graphical model has one 
spurious edge {a, e} and one missing edge {c, d}. 


where the first step follows from the Pythagorean law 
of information projection (Amari et al. 1992), the 
second step follows from the conditional rule of rel¬ 
ative entropy (Cover and Thomas, 2006), the third 


step follows from the information inequality (Cover 


and Thomas, 2006) and finally the fourth step follows 


from the property of information projection to G + ij 
(Wainwright and Jordan! [2008). □ 


5 nodes) is not planar, one of the actual edges is missed 
in the output graph as shown in Figure [3(b)] 

C Example Application: UGM 
Learned Models 


For comparison to our planar learning algorithm, we 
provide the results of using the UGM MRF learn¬ 
ing algorithm on the senate voting data. For all 
figures, we use a force-directed graph drawing algo¬ 
rithm (Fruchterman and Reingold, 1991). Figure [4] 
presents the graph learned using pseudolikelihood, UGM 
pseudo, from the full dataset with the regularization 
parameter set to obtain the same number of edges 
as learned in the planar case (318 edges). Figure [5] 
presents the graph learned using pseudolikelihood, UGM 
pseudo tuned, from the full dataset after selecting the 
regularization parameter from cross-validation tuning. 
Figure [6] presents the graph learned using loopy belief 
propagation, UGM loopy, from the full dataset with the 
regularization parameter set to obtain 318 edges. The 
graph learned using UGM loopy tuned is not displayed 
because it is a nearly fully-connected graph providing 
no visual information. 


B Experiments: Counter Example 


The result presented in Figure [3] illustrates the fact 
that our algorithm does not always recover the ex¬ 
act structure even when the underlying graph is pla¬ 
nar and the algorithm is given exact moments as in¬ 
puts. This counterexample gives insight into how the 
greedy algorithm works. The basic idea is that graph¬ 
ical models can have nodes which are not neighbors 
but are more correlated than some other nodes which 
are neighbors. If the spurious edges corresponding to 
these highly correlated nodes are added early on in the 
algorithm, then the actual edges may have to be left 
out because of the planarity restriction. 


We define a zero-field Ising model on the graph in Fig¬ 
ure |3(a)| with the edge parameters as follows: 9 & c = 
Ocd = 9m = 0.1 and % = 1 for all the other edges. 
Figure 3(a) shows the edge parameters in the graph 
pictorially using the intensity of the edges - higher the 
intensity of an edge, higher the corresponding edge pa¬ 
rameter. With these edge parameters, the correlation 
between nodes a and e is greater than the correlation 
between any other pair of nodes. This leads to the 
edge between a and e to be the first edge added in the 
algorithm. However, since Kb (the complete graph on 



























Jason K. Johnson, Diane Oyen, Michael Chertkov, Praneeth Netrapalli 



Martine^Mek^ (D-MA^ 


Snowe (R-MEl 

Collins (R-MEj, 


Bayh (D-IN' 


Nelson (D-NE^ McCaskill (D^l 


Salazar (D-CO^ 



CliiWbki (D-MK^ Biden (D-DEgpecter (R-PA^ 


Byrd (D-WV^ 


Crapo (R- ID i 


Risch (R-IDJ 


Barrasso (R-Vef fci IgtteWMMg -MA^ 


Figure 4: Senate voting graph learned by UGM pseudo with 318 edges. Blue nodes represent Democrats, red 
nodes represent Republicans and black nodes represent Independents. 
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Figure 5: Senate voting graph learned by UGM pseudo tuned. Blue nodes represent Democrats, red nodes 
represent Republicans and black nodes represent Independents. 
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Figure 6: Senate voting graph learned by UGM loopy with 318 edges. Blue nodes represent Democrats, red nodes 
represent Republicans and black nodes represent Independents. 
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